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1. Introduction 


Measures of interrater agreement like kappa of Cohen (and its weighted versions) and intraclass 
correlations are usually defined for ratings regarding a group of targets (subjects or objects), each 
rated by the same group of raters. This happens when the agreement among clinical diagnoses 
provided by more physicians on the same set of patients is analysed for identifying the best treatment 
for the patients, or when the agreement among ratings of educators who assess on a new ordinal 
rating scale the language proficiency ofa corpus of argumentative (written or oral) texts is considered 
to test reliability of the new scale. 

In other situations, the agreement between ratings is analysed in a group of targets where each 
target is evaluated by a different group of raters, like for instance when teachers in a school are 
evaluated by a questionnaire administered to all the pupils (students) in the classroom. In these 
situations, it is important to analyse the reliability of the judgments by a measure of agreement 
between ratings, but since the ordering of the ratings assigned to each target is irrelevant, the measure 
can only be defined starting from the single target level. 

In this paper, an index is proposed to evaluate the agreement between raters for each single target 
rated on an ordinal scale, and to obtain also a global measure of the interrater agreement for the whole 
group of targets evaluated. The main features of the proposal will be illustrated in a study for the 
assessment of the behaviour of student teachers in the classroom. Data were collected in a research 
conducted in 2018 at Roma Tre University with students of the degree course in Formazione 
Primaria, during their experience of internship (“tirocinio”). 


2. Target-specific measures of interrater agreement 


When ratings provided on a quantitative (interval or ratio) scale are analysed in a group of targets 
where each target is evaluated by a different group of raters, a first approach available to measure the 
level of agreement for the whole group of targets is based on the ANOVA one-way random model 
(e.g., Shrout & Fleiss, 1979, McGraw & Wong, 1996). The intraclass correlation (ICC) for this model 
is the between-target variance divided by the sum of the between-target variance and the error 
variance (this sum is the ratings total variance). A high value of ICC indicates a good agreement 
among raters, because it is obtained when the between-target variance exceeds the error variance 
(that includes the within-target variance) by a wide margin. However, a low ICC value is not 
necessarily an indication of poor agreement, because a severe restriction in the range of ratings 
assigned in good agreement by the raters can cause low values of the between-target variance and 
low values of the ICC (the restriction of variance problem, LeBreton et al., 2003). 

To overcome this problem of the ICC, target-specific measures of interrater agreement were 
proposed to work separately with each target i in the corresponding row of ratings in the targets x 
raters data matrix. James et al. (1984) proposed the index 
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where s? is the observed variance of the ratings in profile i, of is the variance obtained from a 
theoretical null distribution representing a complete lack of agreement among raters (e.g., the 
uniform distribution). For raters in perfect agreement, we have s? = 0, with a corresponding value 
Two,i = 1. For a total lack of agreement, the observed variance approaches the variance obtained 
from the theoretical null distribution. This leads rwg; to approach 0. 

A global measure of agreement for the whole group of targets can be defined as the arithmetic 


average of the rw; values (fwg = De Twe,i)- The accuracy of the index depends strongly on the 


specification of the null distribution, and negative values could be obtained. Other possible indices 
for quantitative scales are reviewed, for instance, in LeBreton & Senter (2008). Recently, Bove 
(2022) has considered the normalised standard deviation and the coefficient of variation as possible 
alternatives to ICC and fwg i- 

All the approaches described regard quantitative scales and are not appropriate for ordinal and 
nominal scales. Most of the indices of interrater agreement proposed for ratings on an ordinal scale 
(frequently averages of the weighted kappa of Cohen calculated for each of the possible pairs of 
raters) are not suitable for ratings regarding a group of targets, each rated by a different group of 
raters. 

In order to propose a new index of interrater agreement for ordinal scales, the representation of 
the profile of the ratings for target i on a K-level ordinal scale in Table 1 is considered, 


Table 1 — Profile of the ratings for target i on a K-level ordinal scale 


Target Level 1 Level 2 Level K Total 


where, fig is the number of raters assigning level k to target i and R; is the number of raters that rate 
target i. We propose a general approach that defines target-specific interrater agreement indices as 
normalised indices of variability for the distribution in profile 7, according to the measurement level 
of the scale. A global measure of agreement can be defined as the arithmetic average of the target- 
specific values of the indices. 

So, for ordinal scales, the following index of interrater agreement can be considered (analogous 
with the measure of dispersion for ordinal variables, e.g., Leti, 1983), 
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where Fig is the cumulative proportion associated with level k of the scale in the response profile i, 
for k=1,2,.....K, Dmax is the maximum of D; = 2 YX2! F,(1 — Fig), and itis Dmax = > as R; 
is even, and Dma = (1 = =) as R; is odd. 

The index 6; is always nonnegative, it is 6; = 1 in the case of maximum agreement and 6; = 
0 in the case of maximum disagreement. Some simulations and experiences with real applications 
suggest the following thresholds for the interpretation of the values assumed by the 6; index: values 
lower than 0.6 indicate low to moderate agreement, values between 0.6 and 0.8 good agreement, 
above 0.8 excellent agreement. The index allows for the identification of particular targets for which 
agreement is low: this is not possible with measures like kappa or intraclass correlations. Besides, a 
global measure of agreement can be defined as the arithmetic average of the 6; values obtained for 
the N targets (5 = Se ôi). The index is not affected by the possible concentration of ratings in a 
few levels of the scale, like it happens for the measures based on the ANOVA approach or for the 
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kappa-type indices, and it does not depend on the definition of a null distributions like rwg,i. 

In the next section, an application will be shown in which teachers in a school are evaluated by 
a questionnaire administered to all the pupils in the classrooms, so each teacher is evaluated by a 
different group of pupils. In this situation, it is interesting to analyse the level of dispersion of the 
ratings in the classrooms with respect to each question of the questionnaire, in order to investigate 
aspects of rating’s reliability. Then, a matrix A = (6;;) is defined where each row corresponds to a 
teacher and each column to a question, and the entry 6;; is the value of 6; computed in the classroom 
of teacher i for question j (an example is provided in Table 2). Entries of matrix A can be considered 
as similarities between teachers and questions. The values 6;; can be depicted in a diagram by the 
unfolding model (originally proposed by Coombs (1964) for rectangular matrices of preference 
scores). The model is 


f (Sis) =u [oten(ai = bjs) + &ij, (1) 


where f is a monotone transformation, mapping the similarities 6;; into a set of dissimilarities 
Pij (€-8., Pij = 1 — 6;;), Qis and bj, are the coordinates respectively of row (teacher) i and column 
(question) j on dimension s in an t-dimensional space and €;; is a residual term. It is worth to notice 
that the Euclidean distance model usually used in multidimensional scaling for square dissimilarity 
matrices (e.g., Borg & Groenen 2005) is a constrained version of model (1), because for each j it is 
required bj, = djs. 

So, a diagram for the pattern of relationships is obtained where each row (teacher) is represented 
as a point with coordinates a;, and each column (question) as a point with coordinates bj,. In the 
planar representation (2), the distance between row (teacher) i and column (question) j 
approximates the corresponding dissimilarity p;; (so, for instance, we can detect in the diagram both 
the teachers and the questions with low/high levels of agreement of ratings in the classrooms). 
Distances within each of the two sets of the row-points and the column-points are only implicitly 
defined and do not have corresponding observed entries in the data matrix. Parameters in the model 


(1) are estimated by iterative algorithms that, starting from initial estimates of a%, bis (initial 


configuration), iteratively decreases a least squares loss function moving vectors a? = 
(af, ah, a9.) and b? = (bi bi, flies ba), until convergence to a minimum. An important 
point is picking a good initial configuration to avoid the problem of local minima. 


3. Application 


A reduced version for pupils of the Teachers’ Educational Practices Questionnaire (TEP-Q, 
Catalano et al., 2014) was administered to evaluate a group of 24 female student teachers of Roma 
Tre University, during their training (internship) in several primary schools of the Italian region 
Lazio, in school year 2018. The questionnaire consists of the following 12 questions regarding 
teachers behaviour in the classroom: “In the class she was relaxed” (Q1),“Before each activity, she 
clearly explained what we had to do” (Q2), “When someone approached her, she turn to look at him” 
(Q3), “She help us to repeat one thing better if we were not so clear” (Q4), “When someone of us 
was saying something, she interrupted him” (Q5), “When she talked to us, she also used gestures 
(for example, she moved her hands)” (Q 6), “She yelled at the class when she get angry” (Q7), “If 
someone of us needed to be consoled, she has noticed it, even if he did not tell her” (Q8), “During 
the activities she told us we could help each other” (Q 9), “When she was tired, she complained in 
class” (Q 10), “She made us do group work” (Q 11), “She praised us when we deserved it” (Q 12). 
Answers were provided on a 4-levels Likert scale (1=almost never, 4=almost always). 
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For each student teacher, ratings were obtained from the pupils in the classroom (24 school 
classrooms, 418 pupils, 204 females, 214 males, aged between 7 and 12 years). For each student 
teacher i and each question j, the 6;; value of the index was computed in order to analyse the 
reliability of the ratings provided by the pupils in the school classroom. Table 2 contains the matrix 
of the 6;; values and in addition, in the last row, the average ô j for each question. 


Table 2 — Values 6;; obtained for student teachers and questions in the twenty-four school 
classrooms. 


STUDENT TEACHER 
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0.69 


Different levels of reliability characterize the twelve questions. Questions 2 and 10 have high 
values of the average index (0.86 and 0.79, respectively), that means the pupils usually agree in the 
responses (in several classrooms it is 6;; = 1). On the contrary, questions 6 and 9 have low values 
of the average index (0.39 and 0.43, respectively), that means the pupils frequently have different 
opinions about the aspects of teacher’s behaviour considered in the two questions. The remaining 
questions show low to moderate levels of agreement in the pupil’s responses (average values between 
0.48 and 0.69). 

It is also interesting to analyse the values of the index 6;; respect to each student teacher (rows 
of the matrix in Table 2). For instance, student teachers 10, 14, 19 and 21 have usually high levels of 
agreement between the pupil’s responses in the twelve questions, on the contrary student teacher 20 
has low values of agreement except for questions 2 and 10. 

Model (1) was applied to analyse in a diagram the relationships between student teachers and 
questions. It is assumed p;; = 1—6;; in model (1), this means that distances are inversely 
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proportional to the values 6;;. 

In Figure 1, the solution for 2 dimensions is provided (Stress-=0.29). Distances between 
student teachers and questions represent the level of agreement of the responses for the questions in 
the classroom (the lower the distance the higher the agreement). Question 2, question 10 and, to a 
lesser extent, question | are located in the centre of the diagram, close to many points representing 
teachers, because they have usually high levels of agreement in the responses of the pupils in the 
school classrooms. Questions 6, 9 and 8 have high heterogeneity in many cases, so they are 
positioned far apart from many student teachers. Considering the student teachers, we observe that 
student teacher 20 is far from most questions because she has usually low values of agreement for 
the ratings obtained in her classroom. On the contrary, student teachers 10, 14 and 21 are near the 
centre of the diagram and close to many questions, a consequence of the homogeneity of ratings 
obtained on many questions. 
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Figure 1: Unfolding of the 6;; values for student teachers (empty circles) and questions (full black) 
in Table 2 (the higher 6;; the smaller the distance) 


4. Conclusion 


A descriptive approach has been presented for the analysis of the agreement in ratings given to 
a group of targets, where each target is evaluated by a different group of raters. An index of interrater 
agreement defined at the single target level is proposed for ratings given on an ordinal scale, in a 
manner similar to the definition of the rwg, ; index for ratings on a quantitative scale. Besides, a 
measure of agreement for the whole group of targets is obtained as the average of the target-specific 
values. The index presents some advantages respect to the methods based on ANOVA mean squares 
like intraclass correlation, and respect to many kappa-type indices. Besides, when the index is 
computed for a group of targets and more questions, it is shown that an unfolding model allows to 
analyse in a diagram the matrix of the values of the index obtained for each target-question pair. 

The index proposed is mainly considered as a measure of size of the interrater agreement, 
therefore developments of this research may concern: 1) an accurate definition of reliable thresholds 


161 


useful for the interpretation of the level of agreement in the applications; 2) the study of the sampling 
properties of the index. 
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